10/10/2019
Interpreting 0s and 1s as text…
cat helloworld.txt; echo
## Hello World!
Directly looking at the 0s and 1s…
xxd -b helloworld.txt
## 00000000: 01001000 01100101 01101100 01101100 01101111 00100000 Hello ## 00000006: 01010111 01101111 01110010 01101100 01100100 00100001 World!
Directly looking at the 0s and 1s…
xxd -b helloworld.txt
## 00000000: 01001000 01100101 01101100 01101100 01101111 00100000 Hello ## 00000006: 01010111 01101111 01110010 01101100 01100100 00100001 World!
01001000 is H?
cat hastamanana.txt; echo
## Hasta Ma?ana!
Bit, Byte, Word. Figure by Murrell (2009) (licensed under CC BY-NC-SA 3.0 NZ)
We distinguish two basic characteristics:
R-specific)father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
VARIABLE : Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)
FILENAME : ISCCPMonthly_avg.nc
FILEPATH : /usr/local/fer_data/data/
BAD FLAG : -1.E+34
SUBSET : 48 points (TIME)
LONGITUDE: 123.8W(-123.8)
LATITUDE : 48.8S
123.8W
16-JAN-1994 00 9.200012
16-FEB-1994 00 10.70001
16-MAR-1994 00 7.5
16-APR-1994 00 8.100006
<?xml version="1.0"?> <temperatures> <variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable> <filename>ISCCPMonthly_avg.nc</filename> <filepath>/usr/local/fer_data/data/</filepath> <badflag>-1.E+34</badflag> <subset>48 points (TIME)</subset> <longitude>123.8W(-123.8)</longitude> <latitude>48.8S</latitude> <case date="16-JAN-1994" temperature="9.200012" /> <case date="16-FEB-1994" temperature="10.70001" /> <case date="16-MAR-1994" temperature="7.5" /> <case date="16-APR-1994" temperature="8.100006" /> ... </temperatures>
<?xml version="1.0"?> <temperatures> <variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable> <filename>ISCCPMonthly_avg.nc</filename> <filepath>/usr/local/fer_data/data/</filepath> <badflag>-1.E+34</badflag> <subset>48 points (TIME)</subset> <longitude>123.8W(-123.8)</longitude> <latitude>48.8S</latitude> <case date="16-JAN-1994" temperature="9.200012" /> <case date="16-FEB-1994" temperature="10.70001" /> <case date="16-MAR-1994" temperature="7.5" /> <case date="16-APR-1994" temperature="8.100006" /> ... </temperatures>
<?xml version="1.0"?>
<temperatures>
<variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
...
</temperatures>
The actual content we know from the csv-type example above is nested between the ‘temperatures’-tags:
<temperatures> ... </temperatures>
Comparing the actual content between these tags with the csv-type format above, we further recognize that there are two principal ways to link variable names to values.
<variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
<filename>ISCCPMonthly_avg.nc</filename>.<case date="16-JAN-1994" temperature="9.200012" />.Attributes-based:
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
Tag-based:
<cases>
<case>
<date>16-JAN-1994<date/>
<temperature>9.200012<temperature/>
<case/>
<case>
<date>16-FEB-1994<date/>
<temperature>10.70001<temperature/>
<case/>
<case>
<date>16-MAR-1994<date/>
<temperature>7.5<temperature/>
<case/>
<case>
<date>16-APR-1994<date/>
<temperature>8.100006<temperature/>
<case/>
<cases/>
Note the key differences of storing data in XML format in contrast to a flat, table-like format such as CSV:
Note the key differences of storing data in XML format in contrast to a flat, table-like format such as CSV:
Potential drawback of XML: inefficient storage.
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
JSON:
{"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"gender": {
"type": "male"
}
}
XML:
<person> <firstName>John</firstName> <lastName>Smith</lastName> </person>
JSON:
{"firstName": "John",
"lastName": "Smith",
}
The following examples are based on the example code shown above (the two text-files persons.json and persons.xml)
# load packages
library(xml2)
# parse XML, represent XML document as R object
xml_doc <- read_xml("persons.xml")
xml_doc
## {xml_document}
## <person>
## [1] <firstName>John</firstName>
## [2] <lastName>Smith</lastName>
## [3] <age>25</age>
## [4] <address>\n <streetAddress>21 2nd Street</streetAddress>\n <city>New York</city>\n <state> ...
## [5] <phoneNumber>\n <type>home</type>\n <number>212 555-1234</number>\n</phoneNumber>
## [6] <phoneNumber>\n <type>fax</type>\n <number>646 555-4567</number>\n</phoneNumber>
## [7] <gender>\n <type>male</type>\n</gender>
# load packages
library(jsonlite)
# parse the JSON-document shown in the example above
json_doc <- fromJSON("persons.json")
# check the structure
str(json_doc)
## List of 6 ## $ firstName : chr "John" ## $ lastName : chr "Smith" ## $ age : int 25 ## $ address :List of 4 ## ..$ streetAddress: chr "21 2nd Street" ## ..$ city : chr "New York" ## ..$ state : chr "NY" ## ..$ postalCode : chr "10021" ## $ phoneNumber:'data.frame': 2 obs. of 2 variables: ## ..$ type : chr [1:2] "home" "fax" ## ..$ number: chr [1:2] "212 555-1234" "646 555-4567" ## $ gender :List of 1 ## ..$ type: chr "male"
HyperText Markup Language (HTML), designed to be read by a web browser.
HTML documents/webpages consist of ‘semi-structured data’:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
<h2> hello, world </h2>
</body>
</html>
head and body are nested within the html documenthead, we define the title, etc.head and body are nested within the html documenthead, we define the title, etc.<html>..</html><head>...</head>, <body>...</body><head>...</head>, <body>...</body>HTML (DOM) tree diagram (by Lubaochuan 2014, licensed under the Creative Commons Attribution-Share Alike 4.0 International license).
In this example, we look at Wikipedia’s Economy of Switzerland page.
swiss_econ <- readLines("https://en.wikipedia.org/wiki/Economy_of_Switzerland")
head(swiss_econ)
## [1] "<!DOCTYPE html>"
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"
## [3] "<head>"
## [4] "<meta charset=\"UTF-8\"/>"
## [5] "<title>Economy of Switzerland - Wikipedia</title>"
## [6] "<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,\"$1client-js$2\");RLCONF={\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":!1,\"wgNamespaceNumber\":0,\"wgPageName\":\"Economy_of_Switzerland\",\"wgTitle\":\"Economy of Switzerland\",\"wgCurRevisionId\":913726744,\"wgRevisionId\":913726744,\"wgArticleId\":27465,\"wgIsArticle\":!0,\"wgIsRedirect\":!1,\"wgAction\":\"view\",\"wgUserName\":null,\"wgUserGroups\":[\"*\"],\"wgCategories\":[\"CS1 errors: deprecated parameters\",\"CS1 maint: archived copy as title\",\"CS1 German-language sources (de)\",\"Articles with German-language external links\",\"Webarchive template wayback links\",\"Articles with French-language external links\",\"Wikipedia articles needing clarification from January 2019\",\"Articles containing potentially dated statements from 2006\",\"All articles containing potentially dated statements\",\"Articles containing potentially dated statements from 2012\",\"Articles needing more viewpoints from January 2011\","
Search for specific content
line_number <- grep('US Dollar Exchange', swiss_econ)
line_number
## [1] 198
swiss_econ[line_number]
## [1] "<th>US Dollar Exchange"
# install package if not yet installed
# install.packages("rvest")
# load the package
library(rvest)
# parse the webpage, show the content
swiss_econ_parsed <- read_html("https://en.wikipedia.org/wiki/Economy_of_Switzerland")
swiss_econ_parsed
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="U ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Eco ...
Now we can easily separate the data/text from the html code. For example, we can extract the HTML table containing the data we are interested in as a data.frames.
tab_node <- html_node(swiss_econ_parsed, xpath = "//*[@id='mw-content-text']/div/table[3]") tab <- html_table(tab_node) tab
## Year GDP(in Bil. CHF) GDP per capita(in CHF) GDP growth(real) Inflation rate(in Percent) ## 1 1980 199.3 31,620 5.1 % 4.0 % ## 2 1981 214.0 33,767 1.6 % 6.5 % ## 3 1982 226.5 35,546 −1.3 % 5.7 % ## 4 1983 233.6 36,441 0.6 % 3.0 % ## 5 1984 249.7 38,846 3.1 % 2.9 % ## 6 1985 264.8 41,020 3.7 % 3.4 % ## 7 1986 277.8 42,844 1.9 % 0.7 % ## 8 1987 288.3 44,209 1.6 % 1.4 % ## 9 1988 306.4 46,652 3.3 % 1.9 % ## 10 1989 330.8 49,970 4.4 % 3.2 % ## 11 1990 358.4 53,705 3.6 % 5.4 % ## 12 1991 374.5 55,432 −0.8 % 5.9 % ## 13 1992 381.8 55,808 −0.2 % 4.0 % ## 14 1993 390.3 56,507 −0.1 % 3.2 % ## 15 1994 400.3 57,439 2.4 % 2.7 % ## 16 1995 405.3 57,745 0.5 % 1.8 % ## 17 1996 408.2 57,792 0.6 % 0.8 % ## 18 1997 415.8 58,733 2.3 % 0.5 % ## 19 1998 427.4 60,238 2.9 % 0.0 % ## 20 1999 435.2 61,087 1.7 % 0.8 % ## 21 2000 459.7 64,173 4.0 % 1.6 % ## 22 2001 470.3 65,341 1.3 % 1.0 % ## 23 2002 471.1 64,968 0.2 % 0.6 % ## 24 2003 475.6 65,025 0.1 % 0.6 % ## 25 2004 489.6 66,483 2.6 % 0.8 % ## 26 2005 508.9 68,636 3.2 % 1.2 % ## 27 2006 540.5 72,465 4.1 % 1.1 % ## 28 2007 576.4 76,763 4.1 % 0.7 % ## 29 2008 599.8 78,991 2.1 % 2.4 % ## 30 2009 589.4 76,530 −2.2 % −0.5 % ## 31 2010 608.2 78,121 2.9 % 0.7 % ## 32 2011 621.3 78,946 1.8 % 0.2 % ## 33 2012 626.2 78,723 1.0 % −0.7 % ## 34 2013 638.3 79,404 1.8 % −0.2 % ## 35 2014 649.8 79,827 2.5 % 0.0 % ## 36 2015 653.7 79,346 1.2 % −1.1 % ## 37 2016 659.0 79,137 1.4 % −0.4 % ## 38 2017 668.1 79,357 1.1 % 0.5 % ## Unemployment (in Percent) Government debt(in % of GDP) ## 1 0.2 % k. A. ## 2 0.2 % k. A. ## 3 0.4 % k. A. ## 4 0.9 % k. A. ## 5 1.1 % k. A. ## 6 1.0 % k. A. ## 7 0.8 % k. A. ## 8 0.8 % k. A. ## 9 0.7 % k. A. ## 10 0.6 % k. A. ## 11 0.5 % 34.4 % ## 12 1.0 % 36.1 % ## 13 2.5 % 40.9 % ## 14 4.5 % 46.7 % ## 15 4.7 % 50.1 % ## 16 4.2 % 52.9 % ## 17 4.7 % 54.4 % ## 18 5.2 % 57.2 % ## 19 3.9 % 59.6 % ## 20 2.7 % 55.9 % ## 21 1.8 % 54.7 % ## 22 1.7 % 52.9 % ## 23 2.5 % 59.1 % ## 24 3.7 % 58.2 % ## 25 3.9 % 59.6 % ## 26 3.8 % 56.8 % ## 27 3.3 % 50.5 % ## 28 2.8 % 46.5 % ## 29 2.6 % 46.8 % ## 30 3.7 % 45.2 % ## 31 3.5 % 44.0 % ## 32 2.8 % 44.1 % ## 33 2.9 % 44.7 % ## 34 3.2 % 43.8 % ## 35 3.0 % 43.7 % ## 36 3.2 % 43.6 % ## 37 3.2 % 43.3 % ## 38 3.2 % 42.8 %
“One way to answer this question is to consider the sum total of data held by all the big online storage and service companies like Google, Amazon, Microsoft and Facebook. Estimates are that the big four store at least 1,200 petabytes between them. That is 1.2 million terabytes (one terabyte is 1,000 gigabytes).” (Gareth Mitchell, ScienceFocus)
Murrell, Paul. 2009. Introduction to Data Technologies. London, UK: CRC Press.